On the Characteristics of Language Tags on the Web

نویسنده

  • Joel Sommers
چکیده

The Internet is a global phenomenon. To support broad use of Internet applications such as the World Wide Web, character encodings have been developed for many scripts of the world’s languages and there are standard mechanisms for indicating that content is in a particular language and/or tailored to a particular region. In this paper we study the empirical characteristics of language tags used in HTTP transactions and in web pages to indicate the language of the content and possibly the script, region, and other information. To support our analysis, we develop a new algorithm to infer the value of a missing language tag for elements used to link to alternative language content. We analyze the top-level page for websites in the Alexa Top 1 Million, from six geographic perspectives. We find that one third of all pages do not include any language tags, that half of the remaining sites are tagged with English (en), and that about 10K sites have malformed tags. We observe that 80K sites are multilingual, and that there are hundreds of sites that offer content in the tens of languages. Besides malformed tags, we find numerous instances of correctly formed but likely erroneous language tags by using a Näıve Bayes-based language detection library and comparing its output with a given page’s language tag(s). Lastly, we comment on differences in language tags observed for the same site but from different geographic vantage points or by using different client language preferences via the HTTP Accept-Language header.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی میزان تطابق زبان نمایه‌سازان، نویسندگان و برچسب‌گذاران در پایگاه اطلاعاتی اریک و مندلی

Objective: The purpose of this study was to identify the language consistency between indexers, authors and taggers in the ERIC and Mendeley databases. Methodology: This survey was conducted using content analysis methods and techniques to evaluate the language consistency between indexers, authors and taggers in the ERIC and Mendeley databases and also to determine common keywords. The sample ...

متن کامل

انطباق عناصر فرادادۀ وب‏سایت کتابخانه‏های مرکزی دانشگاه‏های علوم پزشکی با عناصر فرادادۀ هسته دوبلین

Introduction: Considering the importance of library websites in the establishment of communication and provision of services for their users, it is crucial to include those features in these websites which can lead to increased dynamism and optimal communication. The present study aimed at comparing Metadata elements of Dublin Core with those of the websites of Central Libraries of Medical Univ...

متن کامل

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

بررسی وضعیت ابربرچسب­ها در ساختار وب­سایت­های کتابخانه­های مرکزی دانشگاه­های علوم پزشکی ایران

    Introduction : One of the recommended ways in organizing the information in the websites is the application of Meta Tags. The application of a variety of Meta Tags can affect the precision rate of search engines retrieval. They can also promote the rank of a website. The purpose of the study was to investigate the structure of libraries websites based on Meta Tags in medical science univers...

متن کامل

An Executive Approach Based On the Production of Fuzzy Ontology Using the Semantic Web Rule Language Method (SWRL)

Today, the need to deal with ambiguous information in semantic web languages is increasing. Ontology is an important part of the W3C standards for the semantic web, used to define a conceptual standard vocabulary for the exchange of data between systems, the provision of reusable databases, and the facilitation of collaboration across multiple systems. However, classical ontology is not enough ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018